Linear Regression: Basics

machine learning

statistics

My take on machine learning based on statistics

Author

Karl Marquez

Published

August 20, 2025

Modified

August 27, 2025

To tackle linear regression, let’s set up the environment and upload the data advertising data set:

Show me the code

packages <- c("ISLR2", "tidyverse", "ggplot2", "gridExtra", "plotly",
              "scatterplot3d")
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}
invisible(lapply(packages, library, character.only = TRUE))

library(tidyverse)
library(ISLR2)
library(ggplot2)
library(gridExtra)
library(plotly)
library(scatterplot3d)

karl_theme <- theme_bw() +
  theme(plot.title = element_text(size=20),
        axis.title = element_text(size = 15),
        axis.text = element_text(size = 12),
        legend.title = element_text(size=12),
        legend.text = element_text(size=12))

advertising <- read.csv("Advertising.csv")

Using Advertising data set, shown is a scatterplot of three advertising methods (input) to sales (output).

Show me the code

p1 <- ggplot(data = advertising, aes(x = TV, y = sales)) +
        geom_point(alpha = 0.75, size = 2, color = "steelblue") +
        geom_smooth(method = "lm") +
        karl_theme
p2 <- ggplot(data = advertising, aes(x = radio, y = sales)) +
        geom_point(alpha = 0.75, size = 2, color = "steelblue") +
        geom_smooth(method = "lm") +
        karl_theme
p3 <- ggplot(data = advertising, aes(x = newspaper, y = sales)) +
        geom_point(alpha = 0.75, size = 2, color = "steelblue") +
        geom_smooth(method = "lm") +
        karl_theme
grid.arrange(p1, p2, p3, ncol = 3)

TV, radio, and newspaper seems to have a positive association to sales. The more the company spends on TV advertisement, the higher the associated sales. Newspaper seems to be the weakest advertising method.

Some questions that linear regression can answer:

Is there a relationship between advertising budget and sales?
How strong is the relationship between advertising budget and sales?
Which media are associated with sales?
How large is the association between each medium and sales?
How accurately can we predict future sales?
Is the relationship linear?
Is there synergy among the advertising media?

Simple Linear Regression

It is a method of predicting a quantitative $Y$ on the basis of a single predictor variable $X$, assuming the relationship between $X$ and $Y$ is linear. It is given by the equation \[ Y \approx \beta_0 + \beta_1 X \] In our advertising data set, $Y$ is sales, and $X$ can be TV. We can regress sales onto TV by fitting the model \[ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} \] Here, the unknown constants, or coefficient parameters, are $\beta_0$ (intercept) and $\beta_1$ (slope). We estimate these numbers using the training data set. Once we have the coefficients, we can predict future sales using the same equation \[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x \] The hat symbol, ^, is the estimated value for an unknown parameter, coefficient, or predicted response.

Estimating the Coefficients

The question is how do we calculate the coefficient parameters using the training data set? We want to find the $\beta_0$ (intercept) and $\beta_1$ (slope) such that the resulting line is as close as possible to the data points. How do we measure this closeness? The most common approach is the least squares criterion. To illustrate, see below:

Show me the code

advertising <- advertising %>% 
  mutate(res = residuals(loess(sales ~ TV)))
ggplot(data = advertising, aes(x = TV, y = sales)) +
  geom_point(alpha = 0.75, size = 2.5, color = "steelblue") +
  geom_smooth(method = lm, color = "blue") +
  geom_segment(aes(xend = TV, yend = sales - res), color = "red", alpha = 0.75) +
  karl_theme

The residual, $e_i$, is $e_i = y_i - \hat{y}_i$, which is the difference between the $i$th observed response value and the $i$th response value predicted by the linear model. The residual sum of squares (RSS) is then \[ RSS = e_1^2 + e_2^2 + \cdots + e_n^2 \] The figure above displays the linear regression fit to the Advertising data, where $\beta_0$ = 7.03 and $\beta_1$ = 0.0475. In simple words, an additional $1,000 spent on TV advertising is associated with selling approxiamtely 47.5 units of the product.

Assessing the Accuracy of Coefficients

The true relationship between X and Y is \[ Y = f(X) + \epsilon \] If $f(X)$ is approximated to be linear, then the relationship is \[ Y = \beta_0 + \beta_1 X + \epsilon \] where the error term $\epsilon$ is a catch-all for what we miss with this simple model. This error is independent of $X$. This equation is also the population regression line which is the best linear approximation to the true relationship between $X$ and $Y$. To illustrate between the population regression line and the least squares line, see below:

Left:
- Red line is the population regression line
- Blue line is the least squares line
  - the least squares estimate for $f(X)$ based on observed data
Right:
- Red and blue lines as in the left panel
- Light blue lines are ten least square lines computed on the basis of a separate random set of observations
  - On average, the least squares lines are close to the population regression line

The least squares lines and the population regression line are different but close, just like the sample mean will be different from the population mean, but it will provide a good estimate. This estimation is unbiased.

Important

Unbiased estimation expects the least squares line (or sample mean) to be close to the population regression line (population mean) if we increase the number of observations. It does not systematically over or under estimate the true parameter.

The Standard Error

The accuracy of a single estimate of a sample mean $\hat{\mu}$ to the population mean $\mu$ is given by computing the standard error of $\hat{\mu}$, given by: \[ \mathrm{Var}(\hat{\mu}) = \mathrm{SE}(\hat{\mu})^2 = \frac{\sigma^2}{n} \] where $\sigma$ is the standard deviation of each of the $Y_i$ of $Y^2$. It tells us the average amount the $\hat{\mu}$ differs from the actual $\mu$, and that as you increase $n$ the lower the standard error. An equivalent standard error equation for the linear regression coefficient parameters are given by:

\[ \text{SE}(\hat{\beta}_0)^2 = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right) \]

and

\[ \text{SE}(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \] Some insights about these equations:

$\sigma^2$ is Var($\epsilon$)
Standard Error of $\hat{\beta}_1$ is smaller when $x_i$ are more spread out
- we have more leverage to estimate the slope
Standard Error of $\hat{\beta}_0$ would equal Standard Error of $\hat{\mu}$ if $\bar{x}$ is zero ($\hat{\beta}_0$ would be equal to $\bar{y}$)
The estimate of $\sigma$ is the residual standard error
- $\text{RSE} = \sqrt{\frac{\text{RSS}}{n - 2}}$

Confidence Intervals

Can be computed from standard error
A 95% confidence interval is defined as a range of values such that with 95% probability the range will contain the true unknown value of the parameter
- if we take repeated samples and construct the confidence interval for each sample, 95% of the intervals will contain the true unknown value of the parameter
The range is the lower and upper limits computed from the sample data

For linear regression, the 95% confidence interval for $\beta_1$ is \[ \hat{\beta}_1 \pm 2 \cdot SE(\hat{\beta}_1) \]

For $\beta_0$, the confidence interval is \[ \hat{\beta}_0 \pm 2 \cdot SE(\hat{\beta}_0) \]

In the advertising data, the computed 95% confidence interval for $\beta_0$ is [6.130, 7.935]. We can conclude that in the absence of any advertising (x = 0), sales will, on average, will fall somewhere between 6,130 and 7,935 units.

Also, the 95% confidence interval for $\beta_1$ is [0.042, 0.053]. We can conclude that for each $1,000 increase in TV advertising, there will be an average increase in sales of between 42 and 53 units.

Hypothesis testing

The most common hypothesis testing is testing the null hypothesis:

the null hypothesis $H_0$: There is no relationship between $X$ and $Y$
- $H_0$: $\beta_1 = 0$
the alternative hypothesis $H_a$: There is some relationship between $X$ and $Y$
- $H_a$: $\beta_1 \ne 0$

If $\beta_1$ = 0, then equation $Y = \beta_0 + \beta_1 X + \epsilon$ reduces to $Y = \beta_0 + \epsilon$, where $X$ is now not associated with $Y$. Therefore, to test the null hypothesis, we need to determine whether $\hat{\beta}_1$ is far from zero. This depends on the SE($\hat{\beta}_1$).

if SE($\hat{\beta}_1$) is small, small values of $\hat{\beta}_1$ may provide strong evidence that $\beta_1 \ne 0$ (there is a relationship between $X$ and $Y$)
if SE($\hat{\beta}_1$) is large, then $\hat{\beta}_1$ must be large in absolute value to reject the null hypothesis

To test the null hypothesis, we compute the t-statistic, given by \[ t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} \] This t-statistic measures the standard deviations that $\hat{\beta}_1$ is away from 0. The t-distribution has a bell shape curve with n - 2 degrees of freedom, and for values of n greater than approximately 30 it is quite a similar to the standard deviation.

Using R, we can compute the probability of observing any value equal to $|t|$ or larger, the p-value.

small p-value
- unlikely to observe such a substantial association between the predictor and response due to chance
- there is an association between predictor and response
- we reject the null hypothesis
- typical cutoffs are 5% and 1%